Bioinformatics Advances
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match Bioinformatics Advances's content profile, based on 184 papers previously published here. The average preprint has a 0.16% match score for this journal, so anything above that is already an above-average fit.
Graham, C. L.; Cremona, L.; Little, R.; Rodrigues, C. D.
Show abstract
Multiple sequence alignment (MSA) data underlies current principles in protein folding and protein-protein interaction prediction, from which large language models (LLMs) in tandem with protein datasets, can predict protein structure. However, what is missing are user-friendly tools that enable researchers to predict and demonstrate coevolution - the principal input which these MSAs infer. Here we present tools to identify and visualize coevolution, through a pipeline (CoEVFold) that uses basic direct coupling algorithms derived from GREMLIN and alignment of sequences from MMSEQs2. The pipeline generates a visual representation of coevolution for a single protein but can also represent coevolution of homomeric or heteromeric protein complexes, as well as coevolution within protein networks. The input for this pipeline can be an amino acid sequence, or user input protein structures from Alphafold their own files or the PDB database. In validation of CoEVFolds capabilities, and utilising proteins from known prokaryotic and eukaryotic model systems (Bacillus subtilis, Escherichia coli and Saccharomyces cerevisiae), as well as phage proteins, CoEVFold predicts coevolution between proteins known to interact, proteins known to oligomerise, and coevolution in proteins known to be part of a protein complex. Collectively, these suite of tools, named CoEVFold suite, have broad applicability and provide a useful toolkit to those interested in dissecting protein-protein interactions and networks. AvailabilityThe code is available online at https://colab.research.google.com/drive/1MSSvNTq7KZ4Lr0XTz89vUuK-J3xOTzwS?usp=sharing and Github. https://github.com/MishterBluesky/CoEVFold Supplementary informationSupplementary data is available via Figshare and supplementary materials.
Njipouombe Nsangou, Y. A.; Haug, S.; Ulmer, M. A.; Bellur, O.; Römisch-Margl, W.; Dönitz, J.; Köttgen, A.; Arnold, M.; Kastenmüller, G.
Show abstract
BackgroundKidney disease refers to a broad range of disorders that impair renal structure and function. Among these, chronic kidney disease (CKD) is the most prevalent worldwide, affecting approximately 10% of the global adult population. While large-scale omics studies have identified numerous molecular associations with kidney function and disease, these insights often remain isolated within individual data layers, hindering a systems-level understanding of the functional interplay between genes, proteins, metabolites and clinical phenotypes. MethodsWe developed the Kidney Disease Atlas (KD Atlas) using an extended QTL-based integration strategy combined with a composite network approach. For this purpose, we leveraged results from omics studies in population-based and kidney disease-specific cohorts from the CKDGen Consortium and other large-scale initiatives and integrated them with data from knowledge databases, inferring a comprehensive network of relationships between metabolites, proteins, genes, and kidney disease-related traits. ResultsWe present the KD Atlas, an online resource (https://metabolomics.helmholtz-munich.de/kdatlas) integrating over 25 large studies providing disease-relevant information on 20,456 protein-coding genes, 1,962 proteins, 1,375 metabolites and 40 kidney disease phenotypes connected by more than 1.2 million relationships. Through an interactive web interface, researchers can dynamically construct context-specific molecular subnetworks and perform integrated analyses without requiring specialized bioinformatics expertise. Application showcases demonstrate the resources utility for providing the molecular context of KD-associated genes or metabolites and for generating novel mechanistic hypotheses. ConclusionThe KD Atlas provides a global, multi-omics network view of kidney pathophysiology through an intuitive interface, empowering researchers to formulate mechanistic hypotheses and prioritize candidate targets for subsequent experimental validation.
Jin, Y.; Sverchkov, Y.; Sushkova, A.; Ohtake, M.; Emfinger, C.; Craven, M.
Show abstract
MotivationLarge-scale gene knockdown/knockout screens have been used to gain insight into a wide array of phenotypes and biological processes. However, conducting such experiments is expensive and labor-intensive. In this work, we present a general graph-based machine-learning approach that can predict the effects of gene perturbations on molecular phenotypes of interest given some measured phenotypic effects of other gene perturbations. The motivation for learning models that can predict the effects of gene perturbations is fourfold. Such models can (1) predict effects for unmeasured genes in cases in which cost or technical barriers preclude perturbing every gene, (2) prioritize unmeasured genes or sets of genes for subsequent perturbation experiments, (3) hypothesize mechanisms that underlie the relationships between the perturbed genes and their effects, and (4) generalize to other unmeasured phenotypes of interest. ResultsWe evaluate our approach by applying it, in conjunction with four different learning methods, to learn models for four varied phenotypes. Our empirical evaluation demonstrates that the learned models (1) show relatively high levels of predictive accuracy across the four phenotypes, (2) have better predictive accuracy than several standard baselines, (3) can often learn accurate models with small training sets, (4) benefit from having multiple sources of evidence in the input representation, (5) can, in many cases, transfer their predictive value to other phenotypes. Availability and ImplementationThe Assembled datasets and source code for this work is available at: https://github.com/Craven-Biostat-Lab/graph-molecular-phenotype-prediction
Jadamba, E.; Lee, S.-H.; Hong, J.; Lee, H.; Lee, S.; Shin, H.
Show abstract
Summary: Recent advancements in artificial intelligence (AI) have led to the development of foundation models that interpret mRNA as a language. Notable examples include CodonBERT, hydraRNA, EVO2, and Helix-mRNA. These models demonstrate significant potential as powerful tools for mRNA research. However, to best of our knowledge, there is currently no publicly available AI model that is both easy to use and capable of analyzing mRNA sequences up to about 4kb, a length scale typical of many therapeutic mRNAs, including those encapsulated within lipid nanoparticls (LNPs). Thus, we propose CDS-BART, a user-friendly, open-source tool that integrates SentencePiece sub-word tokenization with the denoising sequence-to-sequence training of Bidirectional and Auto-Regressive Transformers (BART). CDS-BART was pre-trained on mRNA data from nine taxonomic groups provided by the NCBI RefSeq database. This comprehensive pre-training, coupled with BARTs denoising capability, ensures effective learning of codon usage, mRNA structure, evolution, and regulation. Thus, CDS-BART can ultimately deliver robust performance across a wide range of mRNA prediction tasks. Availability and ImplementationCDS-BART is released under the MIT License. Latest code is available via Github at https://github.com/mogam-ai/CDS-BART.
Mahlich, Y.; Ross, D. H.; Monteiro, L.; McDermott, J. E.
Show abstract
MotivationDespite continuing advances in sequencing and computational function determination, large parts of the studied gene, protein, and metabolite space remain functionally undetermined. Most function assignment is driven by homology searches and annotation transfer from known and extensively studied proteins but often fails to leverage available experimental omics data generated via technologies like mass-spectrometry. ResultsThe VaLPAS (Variation-Leveraged Phenomic Association Screen) framework is available as a Python package and provides a user-friendly platform for calculation of associations between expression patterns of genes or proteins in multi-omic datasets based on various statistical and learning methods. The goal of this approach is to shed light on the functional dark matter of protein space by elucidating previously unknown functions of molecules using guilt by association with molecules of known function. We present results demonstrating the utility of VaLPAS to identify high-confidence predictions for a subset of genes/proteins of unknown function in a previously published multi-omics dataset from the oleaginous yeast, Rhodotorula toruloides. AvailabilityVaLPAS is written in Python. The code is hosted on github (https://github.com/PNNL-Predictive-Phenomics/valpas/).
Sailem, H.; Jamdade, S.
Show abstract
Small interfering RNAs (siRNAs) provide a promising therapeutic approach capable of selectively silencing disease-associated genes; however, achieving high efficacy and specificity while minimising off-target effects remains a significant challenge. Endogenous small RNAs, such as microRNAs (miRNAs) and PIWI-interacting RNAs (piRNAs), exhibit structural features supporting their functions and are biocompatible. Recent advances in RNA foundation models, such as RNA-FM, enable large-scale learning of sequence and structural representations of RNA sequences, offering a powerful framework for studying small RNA functions. Here, we leverage RNA-FM model alongside interpretable biological features to systematically compare endogenous small RNAs (miRNAs and piRNAs) with synthetic siRNAs. Biological features highlighted class-specific patterns: piRNAs showed higher GC content and melting temperature than miRNAs and siRNAs, suggesting higher stability. Synthetic siRNAs bias towards adenine is consistent with design rules aimed to reduce secondary structure formation. Importantly, we mapped RNA-FM embeddings to interpretable features to better understand deep learning outputs and facilitate effective extraction of functionally relevant information. To support RNA exploration, we implemented these functionalities in RNAExplorer (www.rnaexplorer.com), a web-based application that allows analysing and visualising small RNA features interactively. Together, our integrative analysis provides a framework for understanding small RNA biology and improving siRNA therapeutic design strategies.
Napoli, A.; Liorni, N.; Biagini, T.; Giovannetti, A.; Squitieri, A.; Miele, L.; Urbani, A.; Caputo, V.; Gasbarrini, A.; Squitieri, F.; Mazza, T.
Show abstract
Short tandem repeat expansions in exon 1 of the HTT gene drive Huntingtons disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=72 SRC="FIGDIR/small/713334v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@17a54aforg.highwire.dtl.DTLVardef@4dcfc5org.highwire.dtl.DTLVardef@8398edorg.highwire.dtl.DTLVardef@1acefde_HPS_FORMAT_FIGEXP M_FIG Graphical Abstract: STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report. C_FIG
Li, J.; Wang, J.; Dokholyan, N. V.
Show abstract
Due to the limited resolution of experimental data, many determined RNA structures contain physically implausible geometries, such as severe steric clashes and missing atoms. Resolving these defects during RNA structure refinement remains a fundamental challenge. Structure dictates the function, so the geometric accuracy of RNA structure is critical for understanding biological mechanisms. However, traditional algorithms for correction have limitations because of the complexity of RNA structures. We propose ChironRNA, an all-atom diffusion model with E(3)-equivariant graph neural networks to perform RNA refinement by resolving steric clashes and completing missing atoms. In ChironRNA, we adopt a hierarchical approach, including both an all-atom diffusion model and a coarse-grained diffusion model where each nucleotide is represented by a five-point representation. Our pipeline consists of two stages: a training stage and a generation stage. The diffusion model regenerates clashing nucleotide atoms step by step by removing the noise predicted by EGNN. ChironRNA achieves an 80% clash reduction on more than 80% of the test set. It performs better on structures of less than 200 nucleotides, resulting in a high percentage of cases having over 80% clash reduction rate and 100% atom reconstruction rate. Our results demonstrate that ChironRNA successfully resolves steric clashes and rebuilds missing atoms with high precision, offering a robust solution where traditional fine-tuning or enumerative approaches fail.
Ouso, D.; Pollastri, G.
Show abstract
Deep learning (DL) has advanced computational genome annotation tasks such as protein sub-cellular localisation (SCL) prediction. Nonetheless, its potential remains underutilised, primarily because of the limited availability of high-quality reference data and suboptimal input preparation strategies. In this study, we develop and analyse a high-quality dataset derived from the latest release of the universal protein knowledgebase (UniProtKB), designed to address existing challenges and support robust DL-based SCL modelling. The dataset was constructed through extensive quality preprocessing to ensure reliability, manual label mapping to enhance the quantity and diversity of the training data, and stringent partitioning to minimise data leakage. We validated the dataset using independent test sets, achieving up to 10.8% performance improvement, measured by the area under the precision-recall curve (PR-AUC), compared to the state-of-the-art (SoTA). Furthermore, we highlighted potential performance metric inflation in existing SoTA predictors by demonstrating, for the first time, at least 4.8% training-to-testing data leakage (pre-sequence representation) when using only 10% of the training set under homology augmentation (augmentation based on sequence similarity database searches; details in Sub-section 2.1), a commonly used data augmentation strategy in DL-based SCL prediction modelling. SCL2205 will efficiently support the development of robust, trustworthy, and generalisable DL-based SCL predictors, while minimising data leakage and promoting reproducibility. It is openly available under the Creative Commons Zero (CC0 1.0) licence on DRYAD and is conveniently deployed as a package on the Python Package Index - p-scldata.
Li, L. X.; Zhang, Y.; Aguilar, B.; Shah, T.; Gennari, J.; Qin, G.
Show abstract
MotivationLogical gene regulatory network (GRN) models provide interpretable, mechanistic representations of cellular regulation and are widely used in systems biology. However, most existing models remain incomplete, context-specific, and difficult to extend to comprehensive GRNs, limiting their broader applicability to tasks such as drug-response prediction and precision medicine. ResultsWe present KGBN (Knowledge Graph-augmented Boolean Network modeling), a computational workflow for systematically augmenting logical GRN models. KGBN incorporates regulatory interactions derived from curated knowledge graphs as alternative logical rules while preserving the validated structure of existing models. Rule probabilities are optimized against experimental data to represent regulatory uncertainty and achieve data-driven calibration. Applying KGBN to acute myeloid leukemia, we show that extending an existing GRN with drug-target pathways and training against ex vivo drug-response data yields mutation-specific models that recapitulate known therapeutic sensitivities and signaling dependencies, demonstrating the utility of KGBN for interpretable, context-aware GRN modeling. Availability and implementationKGBN is available at https://github.com/IlyaLab/KGBN.
Faulon, J.-L.; Dursoniah, D.; Ahavi, P.; Raynal, A.; Asin-Garcia, E.
Show abstract
SummaryThis study presents dAMN, a hybrid neural-mechanistic model that integrates neural networks with genome-scale dynamic flux balance analysis (dFBA) to predict bacterial growth curves across diverse nutrient environments. dAMN uses neural networks to infer dynamic behavior from initial metabolite concentrations, while mechanistic constraints ensure stoichiometric and thermodynamic consistency based on genome scale metabolic models. dAMN is trained on E. coli and P. putida experimental growth data from media containing various combinations of sugars, amino acids, and nucleobases, and evaluated on two test sets: one for forecasting over time and another for predicting growth dynamics on unseen media. dAMN achieved high predictive power (R2 [≥] 0.9), successfully reproducing growth and substrate depletion dynamics including acetate overflow and glucose-acetate consumption shift for E. coli. An interesting innovation of dAMN is the treatment of the lag phase, enabling realistic adaptation dynamics absent from standard dFBA models. dAMN stands out for its ability to generalize across combinatorial nutrient inputs and produce full growth-curve predictions from minimal input data. Availability and implementationThe dAMN software, along with the associated models and data, is available at https://github.com/brsynth/dAMN-main-release and via DOI 10.5281/zenodo.17908125
Troitino-Jordedo, D.; Mansouri, A.; Minebois, R.; Querol, A.; Remondini, D.; Balsa-Canto, E.
Show abstract
Context-specific genome-scale metabolic models are critical tools for studying cellular metabolism under dynamic conditions. However, most existing methods for deriving these models are designed for steady-state settings and may fail to preserve reactions required for transient metabolic shifts, thereby limiting their compatibility with dynamic FBA. Here, we present GeNETop, a methodology for deriving context-specific GEMs designed to preserve dynamic compatibility. GeNETop integrates flux variability analysis (FVA), network topology metrics based on the Integrated Value of Influence (IVI), and transcriptomic data to identify reactions that are both flux-flexible and structurally influential. Reactions are prioritized based on variability and maximality indices, while topology and gene expression guide further refinement, reducing dependence on fixed expression thresholds. Using batch fermentation of Saccharomyces cerevisiae as a case study, we evaluate GeNETop against established methods for context-specific metabolic reconstruction. The resulting networks remain dynamically feasible across growth phases, capture key metabolic transitions, reduce non-essential reactions, and maintain computational tractability. Overall, GeNETop enables context-specific metabolic reconstructions that are compatible with dynamic simulations while maintaining computational efficiency. By overcoming key limitations of existing approaches, the method supports a more accurate representation of time-dependent metabolic processes in biotechnology and systems biology. Author summaryCellular metabolism relies on complex networks of reactions to process nutrients, generate energy, and build essential compounds for biomass. Context-specific metabolic models aim to represent only the reactions active under a given condition, improving biological realism and reducing computational complexity in flux balance analysis simulations. However, metabolic activity adapts dynamically to changing environmental conditions, and reactions that are inactive at one stage may become essential at another. Many current reconstruction methods are designed for steady-state conditions and may exclude reactions that are required during metabolic transitions, thereby limiting their ability to describe dynamic behavior. Here, we introduce GeNETop, a novel approach that refines context-specific networks by integrating multiple layers of information. GeNETop identifies the most relevant reactions by considering their flexibility, importance within the network topology, and gene activity levels. In this way, the method generates biologically meaningful models that focus on metabolic pathways relevant under dynamic conditions. We tested GeNETop on yeast fermentation, a key process in food and biofuel production. The resulting models capture metabolic changes over time and enable stable dynamic simulations, supporting improved flux balance analysis of time-dependent metabolic processes.
Kiselev, V. Y.; Ainscow, E.
Show abstract
Knowledge graphs (KGs) have become an important asset in biomedical research and drug discovery by enabling the structured integration of heterogeneous biological knowledge. When combined with machine learning (ML), KGs support the identification of novel drug-target relationships, but existing approaches are often KG-centric, relying primarily on graph structure and embeddings while overlooking disease-specific biological and clinical context. Moreover, many high-impact applications depend on proprietary KG infrastructures, limiting accessibility for the broader research community. Here, we introduce Artemis, a practical and generalisable machine-learning framework for indication-aware target prioritisation that integrates public biomedical KGs with clinical evidence from the ChEMBL database. Artemis derives graph-based representations of clinically validated drug targets from multiple publicly available KGs and augments them with disease-relevant clinical features from ChEMBL. This hybrid feature space is used to train supervised ML models across seven disease indications, with performance assessed via cross-validation and guided parameter optimisation. The framework is further evaluated on emerging breast cancer targets reported at the San Antonio Breast Cancer Symposium 2024, demonstrating its ability to prioritise novel candidates. Overall, this work demonstrates that publicly available KGs can be used for actionable, translational target discovery when coupled with clinical data. Artemis provides an accessible, scalable, and cost-efficient alternative to proprietary KG platforms. Thereby offering a practical solution for researchers seeking to prioritise therapeutic targets in real-world drug discovery settings. Key PointsO_LIKG applications can support the identification of novel drug-target relationships but rely primarily on graph structure while overlooking disease-specific biological and clinical context. C_LIO_LIArtemis performs indication-aware target prioritisation that integrates public biomedical KGs with clinical evidence from the ChEMBL database. C_LIO_LIArtemis is evaluated on emerging breast cancer targets reported at the San Antonio Breast Cancer Symposium 2024, demonstrating its ability to prioritise novel candidates. C_LIO_LIArtemis provides an accessible, scalable, and cost-efficient alternative to proprietary KG platforms offering a practical solution for researchers seeking to prioritise therapeutic targets in real-world drug discovery settings. C_LI
Tang, T.; Shen, T.; Li, W.; Chen, Y.; Yuan, S.; Liu, Y.; Yang, X.; Luo, X.
Show abstract
Protein-protein interactions (PPIs) are governed by two fundamental interfacial mechanisms: similarity-driven, often involving symmetric structural motifs, and complementarity-driven, arising from geometric and physicochemical matching between binding surfaces. Despite their biological significance, computational models have largely overlooked the coexistence and interplay of these twofold interaction modes. Here, we introduce DMG-PPI, a dual-channel graph neural network framework that jointly models similarity and complementarity in PPI networks, extending prior heterophilous GNN concepts to explicitly disentangle these dual interaction modes. The model consists of two parallel message-passing pathways: Alignment Message Passing (AMP), which aggregates signals from similar neighbors to capture symmetric interfaces, and Divergence Message Passing (DMP), which contrasts node features to extract complementary binding patterns. The signals captured by AMP and DMP are integrated via an adaptive fusion strategy within each block, and the outputs of blocks are aggregated using the MixHop framework to encode higher-order interaction patterns. DMG-PPI substantially outperforms state-of-the-art methods on classical benchmark datasets, achieving a 7.19% improvement in Micro-F1 over the second-best method. Additionally, the dual-channel framework provides interpretable insights into key binding residues by identifying interfacial mechanisms. Overall, DMG-PPI serves as a powerful tool that reveals the mechanisms behind accurate PPI predictions and facilitates downstream biological analysis. Author summaryProteins interact through multiple interfacial principles rather than a single uniform mechanism, reflecting diverse modes of molecular recognition between binding partners. Accurately modeling this heterogeneity remains a central challenge in protein-protein interaction (PPI) prediction. Existing computational approaches often overlook such complexity and treat all interactions as arising from a dominant pattern. To address this limitation, we present DMG-PPI, a dual-channel graph neural network framework designed to account for distinct interaction principles within a unified model. By modeling both similarity-based and complementarity-based interaction signals, DMG-PPI enables a more comprehensive representation of interaction patterns in PPI networks. Evaluations on widely used benchmark datasets demonstrate that our approach achieves robust and consistent improvements over existing methods. Beyond prediction accuracy, DMG-PPI provides biologically meaningful interpretability by highlighting key residues and revealing distinct interfacial interaction mechanisms, offering valuable insights for downstream structural and functional studies.
Kanchwala, M. S.; Xing, C.; Xuan, Z.
Show abstract
Genome-wide association studies (GWAS) have significantly advanced our understanding of complex traits and diseases, but their interpretive power remains limited due to challenges in identifying causal genes and pathways. Integrating GWAS with multi-omics data--such as gene expression, protein-protein interactions, and gene-pathway networks have the potential to enhance biological insights and improve gene prioritization. To fulfill this potential and need, we developed the GWAS & Multi-omics Integration Pipeline (GMIP), a flexible and scalable framework that incorporates widely used tools such as PoPS, MAGMA, and benchmarker to enrich GWAS findings. However, PoPS suffers from multicollinearity in its features, which can impact performance. To overcome this, we introduce GMIP-PLSR, an extension of GMIP that uses Partial Least Squares Regression (PLSR) to manage multicollinearity effectively. We applied GMIP-PLSR across multiple GWAS datasets, demonstrating superior performance over PoPS in most cases. In a case study on NAFLD, GMIP-PLSR, using features derived from both disease-specific scRNA-seq and general PoPS features, identified gene sets with higher heritability and stronger enrichment in known NAFLD pathways, confirming its ability to enhance GWAS findings. Built on Nextflow, GMIP is computationally efficient, adaptable to diverse research environments, and provides a robust solution for gene reprioritization in post-GWAS analyses. GMIP-PLSR is available at https://github.com/mohammedmsk/GMIP.
Chen, Y. G.; Chung, W.-Y.; Chang, K. Y.
Show abstract
Accurate protein subcellular localization is essential for biological function, and mislocalization is linked to numerous diseases. While current methods like DeepLoc 2.0 employ lightweight fine-tuning of protein language models (PLMs), their ability to predict multi-compartment localization remains limited. To address this, we introduce DualLoc, a multi-label localization predictor for ten compartments. DualLoc leverages full-parameter fine-tuning of a cascaded dual-transformer architecture, built upon foundational PLMs and augmented with attention and dropout layers. We evaluated this framework using three foundational PLMs--ProtBERT, ESM-2, and ProtT5--as backbones. Cross-validation on Swiss-Prot and independent validation on the Human Protein Atlas demonstrate consistent superiority over state-of-the-art baselines. The best-performing variant, DualLoc-ProtT5, achieves 0.5872 accuracy, 0.8271 micro-F1, and 0.7811 macro-F1, with substantial gains in the Matthews correlation coefficient for the nucleus (+0.13), cell membrane (+0.13), and extracellular space (+0.07). Pointwise mutual information analysis of model outputs reveals biologically relevant compartment couplings, notably between the Golgi apparatus and endoplasmic reticulum (PMI = 0.25, P < 10-6), accurately reflecting secretory pathway coordination. DualLoc provides both a highly accurate predictive tool and a robust framework for investigating protein multi-localization mechanisms. Author summaryWhere a protein resides within a cell determines what it does. When proteins end up in the wrong location, normal cellular function breaks down--a misplacement linked to diseases like cancer and Alzheimers. While computational tools exist to predict these locations, accurately tracking proteins that multitask across multiple cellular compartments simultaneously remains a major challenge. We developed DualLoc, a new approach that predicts protein locations across ten different cellular compartments, from the nucleus to the cell membrane. By training an advanced artificial intelligence model on large protein sequence databases, our method more accurately identifies where proteins go, especially in complex, multi-location scenarios. Importantly, our analysis revealed meaningful biological patterns. We found strong predictive links between compartments that work closely together, such as the Golgi apparatus and the endoplasmic reticulum--two organelles that coordinate protein processing and transport. This suggests our model captures genuine cellular logic rather than simply memorizing data. By improving how we predict protein localization, DualLoc helps researchers better understand normal cellular function and disease mechanisms. Our method is freely available to the biomedical community.
Lourenco, V. M.; Ogutu, J. O.; Piepho, H.-P.
Show abstract
Data contamination--from recording errors to extreme outliers--can compromise statistical models by biasing predictions, inflating prediction errors, and, in severe cases, destabilizing performance in high-dimensional settings. Although contamination can affect responses and covariates, we focus on response contamination and evaluate Random Forests through simulation. Using a synthetic animal-breeding dataset, we assess robust Random Forests across several contamination scenarios and validate them on plant and animal datasets. We thereby clarify the consequences of contamination for prediction, develop a robust Random Forest framework, and evaluate its performance. We examine preprocessing or data-transformation strategies, algorithmic modifications, and hybrid approaches for robustifying Random Forests. Across these approaches, data transformation emerges as the most effective strategy, delivering the strongest performance under contamination. This strategy is simple, general, and transferable to other Machine Learning methods, offering a remedy for robust genomic prediction. In real breeding data, robust Random Forests are useful when substantial contamination, phenotypic corruption, misrecording, or train-deployment mismatch is plausible and the goal is to recover a latent signal for genomic prediction and selection; ranking-based robust Random Forests are the dependable first option, whereas weighting-based Random Forests should be used only when their weighting scheme preserves rank structure and improves prediction. Robustification is not universally necessary, but it becomes important when contamination distorts the link between observed responses and the predictive target; standard Random Forests remain the default for clean data, whereas robust Random Forests should be fitted alongside them whenever contamination is plausible, with the final choice guided by data, trait, and breeding objective. Author summaryMachine learning (ML) methods are widely used for prediction with high-dimensional, complex data, and supervised approaches such as Random Forests (RF) have proved effective for genomic prediction (GP) and selection. Yet their performance can be severely compromised by data contamination if the algorithms rely on classical data-driven procedures that are sensitive to atypical observations. Robustifying ML methods is therefore important both for improving predictive performance under contamination and for guiding their practical use in high-dimensional prediction problems. To address this need, we develop robust preprocessing, algorithm-level, and hybrid strategies for improving RF performance with contaminated data. Using simulated animal data, we show that ranking-and weighting-based robust RF provide the strongest overall compromise for genomic prediction and selection under contamination. Validation on several plant and animal breeding datasets further shows that the benefits of robustification are not universal, but depend on the dataset, trait, and breeding objective. Although motivated by RF, the framework we propose is general, practical, and readily transferable to other ML methods. It also offers a basis for deciding when robustness should complement standard RF rather than replace it outright.
Deolankar, S.; Wermeling, F.
Show abstract
CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.
Delamonica, B.; Bat1K 21-Families Group, ; Larijani, M.; MacCarthy, T.; Davalos, L. M.
Show abstract
MotivationSeveral computation gene search tools exist to identify and annotate an ever-growing body of newly sequenced genomes of different species. Many annotation tools, however, fall short when the target species diverges from well-studied model organisms, and when searching for short genes with multiple copies. ResultsWe have developed the Exon Targeted Retrieval and Classification Toolbox, ExTRaCT, an automated pipeline to identify any gene exon with conserved structure in novel species genome assemblies. In the use cases presented here, we applied our search tool to 102 bat genomes to find APOBEC3 gene family members. We show that our homolog search algorithm is efficient (run time average of 5 hours for over 100 genomes), works well with reference sequences distantly related to the target (1 out of 498 misclassifications, 0 false positives and 2 false negatives), and is easy to use. As genomic sequencing becomes faster and more accessible, ExTRaCT has downstream applications in phylogenetic, biochemical and genomic studies. It is a simple computational tool that provides a solution to target gene identification, requiring neither whole-genome-assembly annotations, nor prior knowledge of closely related species. Availabilityhttps://doi.org/10.5281/zenodo.15769018 ContactBrenda.delamonica@stonybrook.edu Supplementary informationSupplementary data are available at Bioinformatics online.
Hao, Y.; Kim, Y.; Aggarwal, B.; Sinha, S.
Show abstract
MotivationJoint Spatial Metabolomics (SM) and Spatial Transcriptomics (ST) profiling is a powerful approach to fine-mapping of metabolic states associated with tissue function. Current computational tools for analysis of "SM+ST" data focus primarily on alignment and integration of the two modalities, with limited support for probing biological relationships between the two molecular layers. ResultsWe present MANTIS, a statistical framework for analyzing co-registered SM+ST profiles at single cell or spot resolution, along with spatial domain or cell type information, to discover metabolite spatial patterns and gene-metabolite relationships. It employs an autocorrelation-preserving permutation strategy to assess statistical significance, yielding calibrated inference under spatial dependence. It disentangles different sources of spatial patterns and correlations, viz., those arising from regional preferences, cell type associations, or other unknown factors. It introduces the use of spatial cross-correlation and spatial partial correlation statistics for quantifying gene-metabolite associations. Across data sets spanning different spatial technologies, tissues and species, MANTIS provides more specific and interpretable discoveries than existing methods through rigorous statistical testing and explicitly modeling confounding structure. To our knowledge, MANTIS is the first toolkit to unify spatial metabolomics, spatial transcriptomics, cell type information and spatial domains within a single framework that emphasizes spatial statistics, hypothesis testing and confounder correction. Availability and ImplementationFreely available on the web at https://github.com/yuhaotuo/MANTIS.